AITopics | Coyoacan

Collaborating Authors

Coyoacan

Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation

Long, Yunbo, Xu, Liming, Brintrup, Alexandra

arXiv.org Artificial IntelligenceFeb-6-2025

To evaluate the fidelity of synthetic tabular data, numerous metrics have been proposed to assess accuracy and diversity, including both low-order statistics (e.g., Density Estimation and Correlation Score (Zhang et al., 2023), Average Coverage Scores (Zein & Urvoy, 2022)) and high-order statistics (e.g., α-Precision and β-Recall (Alaa et al., 2022)). However, these metrics operate at a high level and fail to evaluate whether synthetic data preserves logical relationships, such as hierarchical or semantic dependencies between features. This highlights the need for a more fine-grained, context-aware evaluation of multivariate dependencies. To address this, we propose three evaluation metrics: Hierarchical Consistency Score (HCS), Multivariate Dependency Index (MDI), and Distributional Similarity Index (DSI). To assess the effectiveness of these metrics in quantifying inter-column relationships, we select five representative tabular data generation methods from different categories for evaluation. Their performance is measured using both existing and our proposed metrics on a real-world dataset rich in logical consistency and dependency constraints. Experimental results validate the effectiveness of our proposed metrics and reveal the limitations of existing approaches in preserving logical relationships in synthetic tabular data. Additionally, we discuss potential pathways to better capture logical constraints within joint distributions, paying the way for future advancements in synthetic tabular data generation.

machine learning, natural language, tabular data, (17 more...)

arXiv.org Artificial Intelligence

2502.04055

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Southeast Asia (0.06)
(49 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.88)

Add feedback

Noncommutative Model Selection and the Data-Driven Estimation of Real Cohomology Groups

Guzmán-Tristán, Araceli, Rieser, Antonio, Velázquez-Richards, Eduardo

arXiv.org Artificial IntelligenceNov-29-2024

We propose three completely data-driven methods for estimating the real cohomology groups $H^k (X ; \mathbb{R})$ of a compact metric-measure space $(X, d_X, \mu_X)$ embedded in a metric-measure space $(Y,d_Y,\mu_Y)$, given a finite set of points $S$ sampled from a uniform distrbution $\mu_X$ on $X$, possibly corrupted with noise from $Y$. We present the results of several computational experiments in the case that $X$ is embedded in $\mathbb{R}^n$, where two of the three algorithms performed well.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2411.19894

Country:

North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > Mexico > Guanajuato (0.04)
North America > United States > South Carolina > Richland County > Columbia (0.04)
(7 more...)

Genre: Research Report (0.82)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Sustainable Visions: Unsupervised Machine Learning Insights on Global Development Goals

García-Rodríguez, Alberto, Núñez, Matias, Pérez, Miguel Robles, Govezensky, Tzipe, Barrio, Rafael A., Gershenson, Carlos, Kaski, Kimmo K., Tagüeña, Julia

arXiv.org Artificial IntelligenceSep-18-2024

The United Nations 2030 Agenda for Sustainable Development outlines 17 goals to address global challenges. However, progress has been slower than expected and, consequently, there is a need to investigate the reasons behind this fact. In this study, we used a novel data-driven methodology to analyze data from 107 countries (2000$-$2022) using unsupervised machine learning techniques. Our analysis reveals strong positive and negative correlations between certain SDGs. The findings show that progress toward the SDGs is heavily influenced by geographical, cultural and socioeconomic factors, with no country on track to achieve all goals by 2030. This highlights the need for a region specific, systemic approach to sustainable development that acknowledges the complex interdependencies of the goals and the diverse capacities of nations. Our approach provides a robust framework for developing efficient and data-informed strategies, to promote cooperative and targeted initiatives for sustainable progress.

artificial intelligence, machine learning, sdg, (17 more...)

arXiv.org Artificial Intelligence

2409.12427

Country:

South America > Uruguay (0.04)
North America > Mexico > Mexico City > Coyoacan (0.04)
North America > Haiti (0.04)
(101 more...)

Genre: Research Report > New Finding (1.00)

Industry: Government (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

Design and analysis of tweet-based election models for the 2021 Mexican legislative election

Vigna-Gómez, Alejandro, Murillo, Javier, Ramirez, Manelik, Borbolla, Alberto, Márquez, Ian, Ray, Prasun K.

arXiv.org Artificial IntelligenceJun-21-2023

Modelling and forecasting real-life human behaviour using online social media is an active endeavour of interest in politics, government, academia, and industry. Since its creation in 2006, Twitter has been proposed as a potential laboratory that could be used to gauge and predict social behaviour. During the last decade, the user base of Twitter has been growing and becoming more representative of the general population. Here we analyse this user base in the context of the 2021 Mexican Legislative Election. To do so, we use a dataset of 15 million election-related tweets in the six months preceding election day. We explore different election models that assign political preference to either the ruling parties or the opposition. We find that models using data with geographical attributes determine the results of the election with better precision and accuracy than conventional polling methods. These results demonstrate that analysis of public online data can outperform conventional polling methods, and that political analysis and general forecasting would likely benefit from incorporating such data in the immediate future. Moreover, the same Twitter dataset with geographical attributes is positively correlated with results from official census data on population and internet usage in Mexico. These findings suggest that we have reached a period in time when online activity, appropriately curated, can provide an accurate representation of offline behaviour.

mexico, tweet, twitter, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1140/epjds/s13688-023-00401-w

2301.00626

Country:

North America > Mexico > Estado de México (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > Mexico > Mexico City > Mexico City (0.06)
(17 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Services (1.00)
Government > Voting & Elections (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Privacy Loss of Noisy Stochastic Gradient Descent Might Converge Even for Non-Convex Losses

Asoodeh, Shahab, Diaz, Mario

arXiv.org Artificial IntelligenceMay-16-2023

The Noisy-SGD algorithm is widely used for privately training machine learning models. Traditional privacy analyses of this algorithm assume that the internal state is publicly revealed, resulting in privacy loss bounds that increase indefinitely with the number of iterations. However, recent findings have shown that if the internal state remains hidden, then the privacy loss might remain bounded. Nevertheless, this remarkable result heavily relies on the assumption of (strong) convexity of the loss function. It remains an important open problem to further relax this condition while proving similar convergent upper bounds on the privacy loss. In this work, we address this problem for DP-SGD, a popular variant of Noisy-SGD that incorporates gradient clipping to limit the impact of individual samples on the training process. Our findings demonstrate that the privacy loss of projected DP-SGD converges exponentially fast, without requiring convexity or smoothness assumptions on the loss function. In addition, we analyze the privacy loss of regularized (unprojected) DP-SGD. To obtain these results, we directly analyze the hockey-stick divergence between coupled stochastic processes by relying on non-linear data processing inequalities.

artificial intelligence, machine learning, privacy loss, (15 more...)

arXiv.org Artificial Intelligence

2305.09903

Country:

North America > United States > California > Santa Clara County > Santa Clara (0.04)
North America > Mexico > Mexico City > Coyoacan (0.04)
North America > Canada > Ontario > Hamilton (0.04)
Europe > Spain > Basque Country > Biscay Province > Bilbao (0.04)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.87)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.84)

Add feedback

Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification

Montalvo-Lezama, Ricardo, Montalvo-Lezama, Berenice, Fuentes-Pineda, Gibran

arXiv.org Artificial IntelligenceMar-29-2023

In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS.

artificial intelligence, machine learning, representation, (16 more...)

arXiv.org Artificial Intelligence

2210.07983

Country:

North America > Canada (0.04)
Oceania > Australia (0.04)
North America > United States > California (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Language statistics at different spatial, temporal, and grammatical scales

Sánchez-Puig, Fernanda, Lozano-Aranda, Rogelio, Pérez-Méndez, Dante, Colman, Ewan, Morales-Guzmán, Alfredo J., Pineda, Carlos, Torres, Pedro Juan Rivera, Gershenson, Carlos

arXiv.org Artificial IntelligenceJul-26-2022

Statistical linguistics has advanced considerably in recent decades as data has become available. This has allowed researchers to study how statistical properties of languages change over time. In this work, we use data from Twitter to explore English and Spanish considering the rank diversity at different scales: temporal (from 3 to 96 hour intervals), spatial (from 3km to 3000+km radii), and grammatical (from monograms to pentagrams). We find that all three scales are relevant. However, the greatest changes come from variations in the grammatical scale. At the lowest grammatical scale (monograms), the rank diversity curves are most similar, independently on the values of other scales, languages, and countries. As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales, as well as on the language and country. We also study the statistics of Twitter-specific tokens: emojis, hashtags, and user mentions. These particular type of tokens show a sigmoid kind of behaviour as a rank diversity function. Our results are helpful to quantify aspects of language statistics that seem universal and what may lead to variations.

artificial intelligence, social media, spatial reasoning, (15 more...)

arXiv.org Artificial Intelligence

2207.00709

Country:

North America > Mexico > Mexico City > Mexico City (0.05)
Europe > Spain > Galicia > Madrid (0.04)
South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
(11 more...)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.40)

Add feedback